Skip to content

fix(action_status_bridge): retry deferred fault reports instead of dropping them#472

Merged
bburda merged 3 commits into
mainfrom
fix/action-status-bridge-deferred-fault
Jun 23, 2026
Merged

fix(action_status_bridge): retry deferred fault reports instead of dropping them#472
bburda merged 3 commits into
mainfrom
fix/action-status-bridge-deferred-fault

Conversation

@bburda

@bburda bburda commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Pull Request

Summary

Two fixes (see commits):

1. action_status_bridge: don't lose / promptly deliver faults when the FaultManager service isn't discovered yet (closes #471)

The bridge silently lost the first fault of a failing action when the FaultManager's report_fault service was not yet discovered at the instant the bridge observed the terminal status (startup discovery window; exposed by DDS timing - failing on Humble, passing on Jazzy).

Root cause: on the healthy -> failed transition apply_message() called the fire-and-forget reporter->report() and then committed last_reported_state_ = kFailed unconditionally. If the service was not ready, FaultReporter dropped the request - but the bridge had already recorded the action as reported. The latched (transient_local) status is delivered to a subscription exactly once and a rescan skips already-subscribed topics, so the dropped report was never retried and the high-severity fault was lost permanently.

Fix:

  • Track the latest observed action state (desired_state_) separately from what the FaultManager was told (last_reported_state_); deliver via reconcile() only when they differ and the report is deliverable (service ready; null reporter = unit-test seam), committing only after delivery.
  • Deliver as fast as the discovery floor allows: the FaultManager captures its freeze-frame snapshot when it receives the report, so delivery latency makes that snapshot stale. The happy path reports synchronously in the status callback; a deferred report is retried on a dedicated fast retry timer (retry_period_sec, default 50 ms, armed only while pending), decoupled from the slow discovery rescan, so it lands within one short tick of the service appearing. The executor is deliberately not blocked. The README documents this delivery-latency model and its level-triggered boundaries.
  • reconcile() holds state_mutex_ across gate -> report -> commit so the status callback and retry timer can't double-report under a multi-threaded executor.
  • The prune-vanished heal is gated on service readiness and warns (instead of silently dropping) when a vanished action's heal can't be delivered.
  • The shared FaultReporter fire-and-forget contract is unchanged.

2. gateway: fix flaky SSE handler test under ASan

SSEFaultHandlerTest.StreamSnapshotsEntityContextAtEnqueue flaked under AddressSanitizer. enqueue_event() spun for a fixed budget after publishing; under sanitizer slowdown the handler's on_fault_event (which snapshots the owning entity) could run after that budget, so the test cleared the cache before the snapshot was taken and the replayed event omitted x-medkit. Added SSEFaultHandler::events_received() (a monotonic counter, like connected_clients()) and made enqueue_event() wait until the handler has actually consumed the event. Test-only timing fix; no product behavior change.


Issue


Type

  • Bug fix
  • New feature or tests
  • Breaking change
  • Documentation only

Testing

  • action_status_bridge: 31 unit tests pass (incl. deferred raise/heal delivered once the service is ready, and the retry pass not dropping while the service is down); the test_integration launch test passes (raise + heal); linters + clang-tidy clean. CI Pixi (humble) (originally failing) and Humble/Jazzy/Lyrical/TSan green on the prior commits.
  • gateway SSE: full test_sse_fault_handler (19 tests) passes; the snapshot test passes 30/30 in a repeat loop; clang-format + clang-tidy clean on the changed files.

Checklist

  • Breaking changes are clearly described (and announced in docs / changelog if needed)
  • Tests were added or updated if needed
  • Docs were updated if behavior or public API changed

…opping them

When a failing action's terminal status reaches the bridge before the
FaultManager report service is discovered (e.g. a latched transient_local
status delivered during the startup discovery window), the fire-and-forget
report() is dropped. The bridge previously committed last_reported_state_
unconditionally, so the dropped high-severity fault was never retried: the
latched status is delivered to a subscription only once, and a rescan skips
already-subscribed topics, so no second message ever re-drives the report.

Track the latest observed state (desired) separately from what the FaultManager
was actually told (reported), and commit the transition only once the report is
deliverable (a null reporter is the unit-test seam; a real reporter must have a
discovered service). reconcile_pending() re-attempts undelivered transitions on
the existing rescan tick, so a report dropped during the discovery window is
delivered as soon as the service appears. The "Action ... -> fault" log now
fires only on actual delivery rather than after a silent drop.

Closes #471
Copilot AI review requested due to automatic review settings June 23, 2026 13:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a startup discovery race in ros2_medkit_action_status_bridge where a terminal (latched) ABORTED/CANCELED action status could be observed before the FaultManager report_fault service was discovered, causing the bridge to “commit” the failure locally even though FaultReporter dropped the report—permanently losing the fault.

Changes:

  • Split “latest observed” per-action state (desired_state_) from “successfully delivered to FaultManager” state (last_reported_state_), and only commit after delivery via a new reconcile() path.
  • Add reconcile_pending() on the existing rescan tick to retry deferred raises/heals once the service becomes ready.
  • Add a regression unit test ensuring deferred faults remain pending (not committed/dropped) when the FaultManager service is unavailable; update README to document the behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/ros2_medkit_action_status_bridge/src/action_status_bridge_node.cpp Implements desired-vs-reported reconciliation and rescan-driven retries for deferred fault delivery.
src/ros2_medkit_action_status_bridge/include/ros2_medkit_action_status_bridge/action_status_bridge_node.hpp Documents the new reconcile model and adds DesiredState / desired_state_ tracking.
src/ros2_medkit_action_status_bridge/test/test_action_status_bridge.cpp Adds a regression test covering “service unavailable => fault stays pending” behavior.
src/ros2_medkit_action_status_bridge/README.md Documents deferred delivery + retry semantics during startup discovery.

@bburda bburda self-assigned this Jun 23, 2026
SSEFaultHandlerTest.StreamSnapshotsEntityContextAtEnqueue flaked under
AddressSanitizer: enqueue_event() spun for a fixed budget (20 x 10 ms) after
publishing a FaultEvent, but under sanitizer slowdown on_fault_event - which
snapshots the owning entity - could run after that budget. The test then
cleared the cache before the snapshot was taken, so the replayed event omitted
x-medkit and the assertion failed.

Add SSEFaultHandler::events_received() (a monotonic processed-event counter,
mirroring connected_clients()) and make enqueue_event() spin until the handler
has actually consumed the event before returning. The entity snapshot is then
always taken while the App is still in the cache, independent of execution
speed. No product behavior change; entity resolution stays synchronous in
on_fault_event.
…t retry timer

A fault deferred because the FaultManager service was not yet discovered was only
retried on the slow rescan tick (default 2s). The FaultManager captures the
freeze-frame snapshot when it receives the report, so that latency makes the
snapshot stale - a late fault is reported against context that has already moved
on. Deliver as soon as the report channel is available instead.

- Add a dedicated fast retry timer (retry_period_sec, default 50 ms), decoupled
  from the discovery rescan and armed only while a delivery is pending, so a
  deferred report lands within one short tick of the service appearing. The
  happy path (service discovered) still reports synchronously in the status
  callback.
- Hold state_mutex_ across the reconcile gate, report() and commit so the status
  callback and the retry timer cannot double-report under a multi-threaded
  executor; report()/report_passed() are async so the section stays short.
- Gate the prune-vanished heal on service readiness and warn when a vanished
  action's heal cannot be delivered, rather than dropping it silently.
- Tests: deferred raise and heal are delivered once the service is ready, and
  the retry pass keeps retrying without dropping while the service is down.
- Document the delivery-latency model and its level-triggered boundaries in the
  README; add retry_period_sec to config and the parameter table.
@bburda bburda merged commit 243a2f4 into main Jun 23, 2026
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] action_status_bridge loses first fault when FaultManager service not yet discovered (startup race)

3 participants